Introduction

This assignment is based off of this 2D object detection tutorial which uses pytorch to implement the SSD network in order to detect objects in images within the VOC Dataset. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection

Download dataset and create json files

Only the mount portion has to be run if you already have the dataset downloaded and the json files.

First we mount our google drive

Next download the VOC 2007 Dataset. This takes 6.2 minutes.

Sync the data to your google drive. This should take 33 minutes. You must restart the runtime after this by clicking Runtime -> Restart runtime.

Check that the data is downloaded and that you have the json files. This also remounts the google drive.

This code does not have to be run, the files it creates are given with the assignment. It creates the json files: label_map.json, TRAIN_images.json, TRAIN_objects TEST_images.json and TEST_objects. These are the image paths, ground truth object information and label to number mapping. This should take about 45 miniutes if the data has not been cached.

Create the VOC Dataset loader

Next the Dataset loader for VOC is implemented

Model Implementation

Base layers

First we create the base or encoder part of the network.

You must fill in the ResNet code.

Auxiliary layers

The base layers created the low level feature maps with 512 and 1024 features. Now the higher level feature maps are created for 512, 256, 256 and 256 feature maps.

Prediction layers

At this point we have our 6 feature maps.

The low level feature maps: (N, 512, 38, 38), (N, 1024, 19, 19)

Also the high level feature maps: (N, 512, 10, 10), (N, 256, 5, 5), (N, 256, 3, 3), (N, 256, 1, 1)

Each prior box requires a classification output of size number of classes and also the 4 box location values that are regressed. These convolutions are created in the init function.

In the forward pass all the convolutions are performed on their respective input feature maps. After that there is some work done to modify the tensors and then concatonate them in order to have the classification output shaped like (N, 8732, n_classes) and the box output to be (N, 8732, 4). This is a format that will be easier to work with when the network output is passed to the loss function during training or the output is passed through NMS during testing.

The SSD300 Model

init - Defines all network layers and created prior boxes

create_prior_boxes - Create 8732 prior boxes across the 6 feature maps

forward - Send the input data through the three network components and then return the predicted locations and classification scores.

detect_objects - After a forward pass the predicted objects can be sent to this function during testing in order to perform NMS for the final output.

TODO: Answer the following questions after reading the NMS code and comparing it to the version in the lecture notes / tutorial.


  1. What variables within the batch_size for loop represent "D" and "$\bar{B}$"?


  1. The NMS psuedo code is written with operations such as union and set subtraction. Within the NMS python code how are boxes selected in order to be added to the "D" output?


The MultiBoxLoss

During training the output from the SSD forward pass is then sent to the criterion (set to this function) in order to calculate the loss.

Training

With the model implemented it is time to train. Should take 2 hours and 9 minutes for 10 epochs. Should take 1 hour and 5 minutes for only the VOC2007 dataset with 16 epochs.

Training SSD300 with VGG and the original learning rate adjuster

This can be run without making any changes to the code.

Training SSD300 with ResNet and the original learning rate adjuster

This should be run after implementing the ResNet Base.

Training SSD300 with VGG and using a PyTorch learning rate scheduler

This should be run after modifyng the training loop to use a learning rate scheduler.

Testing

Now let's run the eval code, it should take about 30 minutes per model.

Testing SSD300 with VGG and the original learning rate adjuster

Your model should output an mAP about the same as this:

Mean Average Precision (mAP): 0.589

Testing SSD300 with ResNet and the original learning rate adjuster

Training SSD300 with VGG and using a PyTorch learning rate scheduler

Viewing results

And lastly let's view some images with our detections!